In this portion of the workshop, we’ll now work on how to explore and visualize population structure using the same tools (DAPC, PCA, STRUCTURE, and a tree) in R using some RADseq SNP data.
First, let’s discuss the data and background a bit:
Monarch Butterfy (Danaus plexippus), (c) William Hemstrom 2022
Monarch Butterflies (Danaus plexippus) are a charismatic, migratory insect species originally from North America. Over the last ~200 years, they have travelled to a series of islands across the Pacific Ocean. They reached Hawaii in the 1840s, Australia in the 1870s, and in roughly 1900, they reached the Mariana Islands, including Guam and Rota islands. They are non migratory almost everywhere in the western Pacific, although they secondarily re-aquired migratory behavior in southern Australia and New Zealand in the 1930s.
Expansion of monarch butterflies across the Pacific (Hemstrom et al in revision)
In North America, the migratory population can be roughly divided into two “populations” – one in western North America that spends the winter in California and their summer in the western United States and one in eastern North America that winters in central Mexico and summer in the eastern United States.
Monarch migratory pathways in North America
Note that they are also native, non-migrants in the Carribean, Central, and South America, including Colombia where the first photo shown here was taken.
Located in the Data folder, there is a file called
monarchs.vcf. This is a compressed file in the Variant Call
Format (.vcf). The file format was originally described here
in 2011 and is now the dominant format used to store next-generation
sequencing data, from RAD-seq to whole-genome sequencing. The instructor
will show and describe the format for you.
This specific data contains genotypes for 251 samples at 50,000 SNP loci. This is a subset of the 500,000+ RADseq loci used for this paper. Those samples come from:
The metadata for these samples is stored in a file named
monarch_metadata.tsv in the Data folder.
Today, we are going to use this data to try and answer the following questions:
We’re going to use a couple of R packages to answer these questions. If you are new to R, don’t be concerned–between the tutorial we had this morning and the instructors, you’ll get through it!
The primary package we are going to use is called snpR.
It’s a SNP genomic analysis package that is designed to do a range of
basic genomic analyses, such as calculate observed and expected
heterozygosity, linkage disequilibrium, pairwise-\(F_{ST}\), and so on. It is written around
ease-of-use, particularly when analyses need to be split and run for
multiple populations, families, sampling years, etc. It also has
filtering and visualization tools, which we will be using here.
The other package we are going to use is adegenet, the
same package you used with the Wolf microsatellite data earlier.
First, we need to install the snpR.
install.packages(remotes)
remotes::install_github("hemstrow/snpR")
Next, we need to load in the package:
library(snpR)
Next, we are going to read in our .vcf file, attaching
our sample metadata:
monarchs <- read_vcf("Data/monarchs.vcf", sample.meta = "Data/monarch_metadata.tsv")
We can request a report on our data by calling the
monarch object directly.
monarchs
This tells us a few things: the number of samples and loci, their average and minimum minor allele frequency (the frequency of the less common allele at each locus), the proportion of missing data, and some info no the metadata we have available to us.
Note: metadata categories in snpR are referred
to as facets, and we will use that language here as
well. For example, pop is the facet with
population info in our data, and has the levels described above.
Basically all genomic data is imperfect: genotyping errors,
missing data, linked loci, and so on are essentially ubiquitous. These
can all cloud our data and make it much harder to evaluate. To help
remove some of these, we’re going to apply some filters to our
data. However, because we are inevitably also going to accidentally
remove some good data, we should always compare our data with
different filters. snpR has a function to make this
easy: filter_snps().
There are an staggering number of metrics by which to filter genetic data, but a few of the most common approaches post-genotypes are:
MAF (Minor Allele Frequency) filtering: Removing SNPs for which the rarer (or “minor”) allele is below some specific frequency. Rare alleles can arise because of sequencing or alignment errors and can sometimes obscure population structuring estimation, but may also be biologically important or useful for detecting selection.
HWE filtering: Sequencing errors, paralogy, and a range of other undesirable causes can push loci out of HWE, and HWE is a common assumption of downstream models. However, biologically important processes that we may want to detect can also push loci out of HWE.
Genotyping coverage and missing data: We may want to remove individuals OR loci that that are poorly sequenced, since large amounts of missing data can make those individuals/loci uniformative, and individuals/loci with large amounts of missing data may be more likely to have bad genotype calls where they are genotyped. However, throwing away too many individuals/loci can reduce statistical power, and abundant missing data is an unfortunate reality of most next-generation sequencing efforts.
We can filter our data using snpR’s
filter_snps() function. We can set our filtering choices
using these arguments:
maf: set a minimum acceptable MAF.maf_facets: accept loci that pass the MAF filter in
any facet category.mac or mgc: alternatives to MAF filtering,
remove loci with either a Minor Allele Count (MAC) or that appear in few
individuals (Minor-allele Genotype Count, MGC).hwe: set a minimum acceptable HWE p-value. Loci more
significantly out of HWE will be removed. We can use multiple testing
correction (or “Family-Wise Error correction, FWE) using the
fwe_method argument. holm, for the sequential
Bonferroni approach, is a good, standard, and fairly conservative method
of FWE correction.hwe_facets: accept loci that pass our HWE filter in
all facet categories.min_ind: sets a minimum proportion of
individuals that a locus must be sequenced in in order
to be retained.min_loci: sets a minimum proportion of loci
that an individual must be sequenced at in order to be
retained.We’re going to look for structure with two different filters: a stringent (strict) and non-stringent (relaxed) filter.
The data you have will be our non-stringent filter. It was filtered with these settings:
pop.For comparison, we’ll try a more stringent filtering regime with:
pop.pop.You may wish to stop here and wait for the instructor
First, we’re going to run a Principal Component Analysis. This is a
great first step in any genomic analysis after filtering, since it can
show you if you are missing population structure or have contaminated
samples. snpR contains the function
plot_clusters(), which will let you easily run and plot a
PCA with your data. We can use the facets argument to
automatically color by pop, one of the metadata columns
snpR told us was available earlier.
pca <- plot_clusters(monarchs,
facets = "pop")
The object returned is a (nested) list, with the plot stored under
$plot$pca. We can call that (or the object as a whole) to
see our plot.
pca$plots$pca
Is there population structure present in monarchs across the Pacific?
Based on the PCA structure results, do you have any hypotheses about the establishment pathway in the Pacific?
Run the PCA again, but use the strictly filtered dataset
monarchs_strict instead. Do any of your answers to the
questions above change? If so, why do you think they did?
Tip: if you are having trouble figuring out which color
is which, you can install and use the plotly package to
generate an intractable plot that you can mouse over for population
names
# install and load
install.packages("plotly")
library(plotly)
# make an interactive plot
ggplotly(pca$plots$pca)
DAPC is an alternative approach to PCA that sets individuals into a
defined number of clusters using the same underlying PCA data. We’ll run
it using adegenet.
library(adegenet)
A DAPC can also be a great way to look for structure in your data if
you don’t have a good prior for population assignments or the number of
populations. This method makes use of the find.clusters
function within adegenet to identify the optimal number of
clusters (k) through K-means clustering. A measure of
goodness-of-fit is provided through Bayesian information criterion (BIC)
model selection.
In the code below, we will first run the find.clusters
command, which will prompt you to select the optimal number of clusters
(k). This is usually the inflection point where the BIC value
is the lowest. However, note that it is often beneficial to plot the
results of multiple k values, given that one model may not be
significantly better than the others.
Again, we choose the number of PCs to be 85-95% of the total number of individuals.
monarchs_adeg <- format_snps(monarchs, "adegenet", facet = "pop") # format for adegenet
cluster <- find.clusters(monarchs_adeg, max.n.clust = 20, n.pc = 170, stat = "BIC")
#Assign individuals to the groups determined in the command above
monarch_adeg$pop <- cluster$grp
#Run the DAPC on the assignments from "find.clusters"
monarch_adeg.dapc <- dapc(monarch_adeg, n.pca = 170, n.da = 2)
#Plot the results in PCA-space
scatter(monarch_adeg.dapc, scree.da = TRUE, scree.pca = TRUE, posi.pca = "bottomright", posi.da = "bottomleft", clab = 0)
#Produce a plot of individual membership to each of the clusters
compoplot(monarch_adeg.dapc, lab = "", txt.leg = seq(levels(monarch_adeg.dapc$assign)))
Let’s again find the optimal number of PCs to use in DAPC analysis, then re-do the DAPC analysis with that number of PCs.
opt.pca <- optim.a.score(monarch_adeg.dapc)
monarch_adeg.dapc2 <- dapc(monarch_adeg, n.pca = opt.pca$best, n.da=2)
scatter(monarch_adeg.dapc2, scree.da = TRUE, scree.pca = TRUE, posi.pca = "topleft", posi.da = "bottomleft", clab = 0)
compoplot(monarch_adeg.dapc2, lab = "", txt.leg = seq(levels(monarch_adeg.dapc$assign)))
In this case, 57 principal components are deemed optimal to use.
Neither PCA or DAPC are explicitly genetics methods–they
both are designed to explore any kind of high-dimension quantitative
data (although the specific versions we used here are optimized a bit
for genetics). STRUCTURE, on the other hand, is designed from the ground
up to use genetic data. Your instructor will stop here and talk about
STRUCTURE and how it works.
You may wish to stop here and wait for the instructor
We can run STRUCTURE with snpR as well, using the
function plot_structure(). This function is a bit more
complicated–here’s the more involved, new arguments that we
will use:
k: the number of “populations” that we will assume
exist that we will try to group individuals into. Usually, you should
try a range of k values and see how our results vary.iterations: the number of random MCMC runs to do –
higher is better but slower.burnin: the number of MCMC runs to do before results
are tracked. Allows the algorithm to settle into a reasonable search
area before it actually starts generating estimates. Higher is better to
a point, but slower.structure_path: The path on your computer to the
non-graphical interface version of STRUCTURE. This will vary, but is
probably something like
~/Downloads/structure_windows_console/console/structure.exe
if you are on Windows. Ask you instructor for help if needed!facet.order: This will re-order the results for us to
make them easier to interpret.Structure is also really dang slow with large datasets, and
doesn’t really need that many loci to do a great job. In this case,
we’ll try running it with 3,000 loci by subsetting our data first using
the usual [ operator.
# pick loci to keep randomly
set.seed(23521)
sub <- sample(nsnps(monarchs), 3000, replace = FALSE)
sub <- sort(sub)
# subset
sub_monarchs <- monarchs[sub,]
We can then run plot_structure(). This might take a few
minutes!
struct <- plot_structure(sub_monarchs,
facet = "pop",
method = "structure",
k = 4,
iterations = 1000,
burnin = 100,
facet.order = c("ENA", "WNA", "HAW",
"GUA", "SAI", "ROT",
"NSW", "QLD"),
structure_path = "/usr/bin/structure.exe")
If it seems like it is taking too long, you can hit the red STOP
button in the upper right hand side of your console to stop the run. The
instructor has prepared pre-run results for you if needed. They are
located at Data/monarch_non_strict_structure.RDS, and can
be read in to R like this:
struct <- readRDS("Data/monarch_non_strict_structure.RDS")
The result is a nested list with the plot stored under
struct$plot.
struct$plot
Some notes:
clumpp_path argument can let you do this automatically with
plot_clusters().Answer the same questions you answered for PCA and DAPC based on the STRUCTURE results–are your answers different with STRUCTURE? If so, why do you think that might be?
Re-run STRUCTURE with the monarch_strict data. You
will need to generate a new sub object using the code
above. A pre-run output is available at
Data/monarch_strict_structure.RDS. Are your results
different? If so, why do you think they might be?
You may wish to stop here and wait for the instructor
Lastly, we’ll make and plot a basic neighbor joining tree. We can use
the calc_tree() function in snpR to do this,
but we need to install a dependancy, ape, first. You don’t
need to library() it!
install.packages(ape)
We can then run the tree:
tree <- calc_tree(monarchs_strict)
The labels of this tree will be a bit messy unless we fix them, which
we can do that by using the sample metadata in the monarchs
object. The labels are stored at
tree$tree$.base$tip.label.
# fetch the metadata
meta <- sample.meta(monarchs)
# fix the labels.
tree$.base$.base$tip.label <- meta$pop
We can then use the plot() function to plot the tree. We
need to define colors based on the tip labels, which is a bit of a pain
but doable.
# define our colors, from the khroma package's batlow scale
colors <- c("#001959", "#113F5F", "#2C635B", "#617940",
"#A28A2D", "#E9995E", "#FDB1AB", "#F9CCF9")
# match colors to our data using factor magic -- talk to the instructor for info
tip_colors <- colors[as.numeric(as.factor(tree$.base$.base$tip.label))]
plot(tree$.base$.base, type = "unrooted", tip.color = tip_colors)
Answer the same questions you answered for PCA/DAPC/STRUCTURE results–are your answers different with STRUCTURE? If so, why do you think that might be?
Re-run the tree with the monarch_strict data. Are
your results different?
Now that we’ve run everything, which methods produced the most similar results? Any ideas about why that might be?
What were the overall differences between our stringently and non-stringently filtered datasets?
Taken together, how would you describe the population structure of the data. What does this suggest about establishment history?
Compare the amount of genetic distance between the samples from the Mariana Islands and North America. Which had more structure? Do you have any hypotheses that could explain your findings?